Goto

Collaborating Authors

 automated pipeline


APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets

Neural Information Processing Systems

The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, improving its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models.


An Automated Pipeline for Few-Shot Bird Call Classification: A Case Study with the Tooth-Billed Pigeon

Jana, Abhishek, Uili, Moeumu, Atherton, James, O'Brien, Mark, Wood, Joe, Brickson, Leandra

arXiv.org Artificial Intelligence

This paper presents an automated one-shot bird call classification pipeline designed for rare species absent from large publicly available classifiers like BirdNET and Perch. While these models excel at detecting common birds with abundant training data, they lack options for species with only 1-3 known recordings-a critical limitation for conservationists monitoring the last remaining individuals of endangered birds. To address this, we leverage the embedding space of large bird classification networks and develop a classifier using cosine similarity, combined with filtering and denoising preprocessing techniques, to optimize detection with minimal training data. We evaluate various embedding spaces using clustering metrics and validate our approach in both a simulated scenario with Xeno-Canto recordings and a real-world test on the critically endangered tooth-billed pigeon (Didunculus strigirostris), which has no existing classifiers and only three confirmed recordings. The final model achieved 1.0 recall and 0.95 accuracy in detecting tooth-billed pigeon calls, making it practical for use in the field. This open-source system provides a practical tool for conservationists seeking to detect and monitor rare species on the brink of extinction.


CoRe: An Automated Pipeline for The Prediction of Liver Resection Complexity from Preoperative CT Scans

Ali, Omar, Bone, Alexandre, Accardo, Caterina, Belkouchi, Omar, Rohe, Marc-Michel, Vibert, Eric, Vignon-Clementel, Irene

arXiv.org Artificial Intelligence

Surgical resections are the most prevalent curative treatment for primary liver cancer. Tumors located in critical positions are known to complexify liver resections (LR). While experienced surgeons in specialized medical centers may have the necessary expertise to accurately anticipate LR complexity, and prepare accordingly, an objective method able to reproduce this behavior would have the potential to improve the standard routine of care, and avoid intra- and postoperative complications. In this article, we propose CoRe, an automated medical image processing pipeline for the prediction of postoperative LR complexity from preoperative CT scans, using imaging biomarkers. The CoRe pipeline first segments the liver, lesions, and vessels with two deep learning networks. The liver vasculature is then pruned based on a topological criterion to define the hepatic central zone (HCZ), a convex volume circumscribing the major liver vessels, from which a new imaging biomarker, BHCZ is derived. Additional biomarkers are extracted and leveraged to train and evaluate a LR complexity prediction model. An ablation study shows the HCZ-based biomarker as the central feature in predicting LR complexity. The best predictive model reaches an accuracy, F1, and AUC of 77.3, 75.4, and 84.1% respectively.


AutoCP: Automated Pipelines for Accurate Prediction Intervals

Zhang, Yao, Zame, William, van der Schaar, Mihaela

arXiv.org Machine Learning

Successful application of machine learning models to real-world prediction problems, e.g. financial forecasting and personalized medicine, has proved to be challenging, because such settings require limiting and quantifying the uncertainty in the model predictions, i.e. providing valid and accurate prediction intervals. Conformal Prediction is a distribution-free approach to construct valid prediction intervals in finite samples. However, the prediction intervals constructed by Conformal Prediction are often (because of over-fitting, inappropriate measures of nonconformity, or other issues) overly conservative and hence inadequate for the application(s) at hand. This paper proposes an AutoML framework called Automatic Machine Learning for Conformal Prediction (AutoCP). Unlike the familiar AutoML frameworks that attempt to select the best prediction model, AutoCP constructs prediction intervals that achieve the user-specified target coverage rate while optimizing the interval length to be accurate and less conservative. We tested AutoCP on a variety of datasets and found that it significantly outperforms benchmark algorithms.